FORTUNE-TELLING WITH PYTHON: AN INTRO TO TIME SERIES MODELING

Jonathan Balaban

Connect with me on LinkedIn and Github

Introduction

We often refer to our input features in machine learning as "dimensions". On that note, there's a dimension that pervades almost everything we do and observe as humans. It's the fourth dimension we experience every waking moment: time. But time is quite unlike other data we capture, and often requires unique machine learning approaches. These models and approaches are fairly established in R and a few other languages, but have more recently immigrated to Python.

In regession and classification, we use features (collected during a cross-sectional study/survey/measurement) to predict an outcome. The model and parameters represent part of the underlying relationship between features and outcome. But what if we run out of funds to cross-section (it's possible), or need to predict future outcomes for which the features aren't measurable or don't yet exist?

Examples of time series data and modeling (constant time interval):

  • Hedge fund prediction of stock and index movements
  • Long and short-term weather forecasting
  • Business budgeting and trend analysis
  • Health vitals monitoring
  • Traffic flows and logistic optimization modeling
  • Can you think of others?

Time series data usually contains more than meets the eye, and can often be decomposed into trend, seasonal, and random fluctuation components.

  • Trends
    • Up
    • Down
    • Flat
    • Larger trends can be made up of smaller trends
    • There is no defined timeframe for what constitutes a trend; it depends on your goals
  • Seasonal Effects
    • Weekend retail sales spikes
    • Holiday shopping
    • Energy requirement changes with annual weather patterns
    • Note: twitter spikes when news happens are not seasonal; they aren't regular and predictable
  • Random Fluctuations
    • The human element
    • Aggregations of small influencers
    • Observation errors
    • The smaller this is in relation to Trend and Seasonal, the better we can predict the future

Time series models fall into two camps:

  • Additive
    • Data = Trend + Seasonal + Random
    • What we will be using for our modeling
  • Multiplicative
    • Data = Trend x Seasonal x Random
    • As easy to fit as Additive if we take the log
      • log(Data) = log(Trend x Seasonal x Random)

We should use multiplicative models when the percentage change of our data is more important than the absolute value change (e.g. stocks, commodities); as the trend rises and our values grow, we see amplitude growth in seasonal and random fluctuations. If our seasonality and fluctuations are stable, we likely have an additive model.

Time Series Modeling Process

Time series model selection is driven by the Trend and Seasonal components of our raw data. The general approach for analysis looks like this:

  1. Plot the data and determine Trends and Seasonality
    1. Difference or take the log of the data (multiple times if needed) to remove trends for certain model applications
    2. Stationairity is needed for ARMA models
  2. Determine if we have additive or multiplicative data patterns
  3. Select the appropriate algorithm based on the chart below
  4. Determine if model selection is correct with these tools
    • Ljung-Box Test
    • Residual Errors (Normal Distribution with zero mean and constant variance-homoskedastic, i.i.d)
    • Autocorrelation Function (ACF)
    • Partial Autocorrelation Function (PACF)
Algorithm Trend Seasonal Correlations
ARIMA X X X
SMA Smoothing X
Simple Exponential Smoothing X
Seasonal Adjustment X X
Holt's Exponential Smoothing X
Holt-Winters X X

How to achieve and test for stationarity:

  • The mean of the series is not a function of time:

  • The variance of the series is not a function of time (homoscedasticity):

  • The covariance at different lags is not a function of time:

From A Complete Tutorial on Time Series Modeling in R

  • Info on stationarity
  • Plotting Rolling Statistics
    • Plot the moving average/variance and see if it changes with time. This visual technique can be done on different windows, but isn't as rigorously defensible as the test below.
  • Dickey-Fuller Test
    • Statistical tests for checking stationarity; the null hypothesis is that the TS is non-stationary. If our test statistic is below an alpha value, we can reject the null hypothesis and say that the series is stationary.
$$ Y_t = \rho * Y_{t-1} + \epsilon_t \\$$$$ Y_t - Y_{t-1} = (\rho - 1) Y_{t - 1} + \epsilon_t \\$$

In [7]:
#!pip install pyflux

import numpy as np
import pandas as pd
import pyflux as pf
import warnings
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [8]:
# create a play dataframe from 1-10 (linear and squared) to test how differencing works
play = pd.DataFrame([[x for x in range(1,11)], [x**2 for x in range(1,11)]]).T
play.columns = ['original', 'squared']
play


Out[8]:
original squared
0 1 1
1 2 4
2 3 9
3 4 16
4 5 25
5 6 36
6 7 49
7 8 64
8 9 81
9 10 100

In [9]:
# stationarize linear series (mean and variance doesn't change for sub-windows)
play.original.diff()


Out[9]:
0    NaN
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
Name: original, dtype: float64

In [11]:
# stationarize squared series
play.squared.diff().diff()


Out[11]:
0    NaN
1    NaN
2    2.0
3    2.0
4    2.0
5    2.0
6    2.0
7    2.0
8    2.0
9    2.0
Name: squared, dtype: float64

In [12]:
# stationarize squared with log
np.log(play.squared)


Out[12]:
0    0.000000
1    1.386294
2    2.197225
3    2.772589
4    3.218876
5    3.583519
6    3.891820
7    4.158883
8    4.394449
9    4.605170
Name: squared, dtype: float64

Data Prep and EDA

We'll be looking at monthly average temperatures between 1907-1972


In [15]:
# load data, recast columns if needed, convert to datetime
monthly_temp = pd.read_csv('./mean-monthly-temperature-1907-19.csv', skipfooter=2, 
                           infer_datetime_format=True, header=0, index_col=0, names=['month', 'temp'])
monthly_temp.index = monthly_temp.index.to_datetime()

In [17]:
# describe
monthly_temp.describe()


Out[17]:
temp
count 792.000000
mean 53.553662
std 15.815452
min 11.200000
25% 39.675000
50% 52.150000
75% 67.200000
max 82.400000

In [24]:
# resample to annual and plot each
annual_temp = monthly_temp.resample('A').mean()
monthly_temp.plot();
annual_temp.plot();



In [25]:
# plot both on same figure
plt.plot(monthly_temp)
plt.plot(annual_temp);



In [26]:
# plot with plotly (optional): might need a plotly account and key
import plotly.plotly as py
import plotly.graph_objs as go

data = [go.Scatter(x=annual_temp.index, y=annual_temp.temp)]
py.iplot(data)


Out[26]:

In [27]:
# plot binned yearly segments using resample method
monthly_temp.resample('A').temp.plot();



In [28]:
# violinplot months to determine variance and range
sns.violinplot(x=monthly_temp.index.month, y=monthly_temp.temp);


Are these datasets stationary? We can look at a few things per the list above, including a visual check (there seems to be a small upward trend in the annual, too hard to tell for monthly), a standard deviation check on various differences (smallest one is usually most stationary), and the formal Dickey-Fuller test.


In [31]:
# check montly deviations for various diffs
print(monthly_temp.temp.std())
print(monthly_temp.temp.diff().std())
print(monthly_temp.temp.diff().diff().std())
print(monthly_temp.temp.diff().diff().diff().std())


15.815451540670221
9.455426987794672
9.252200624344837
14.726150786762322

In [32]:
# check annual deviations for various diffs
print(annual_temp.temp.std())
print(annual_temp.temp.diff().std())
print(annual_temp.temp.diff().diff().std())
print(annual_temp.temp.diff().diff().diff().std())


1.2621242173990008
1.7725607336526377
3.117841613811376
5.803232109414729

In [33]:
# define Dickey-Fuller Test (DFT) function
import statsmodels.tsa.stattools as ts
def dftest(timeseries):
    dftest = ts.adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
    #Determing rolling statistics
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)

    #Plot rolling statistics:
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean and Standard Deviation')
    plt.show(block=False)

In [34]:
# run DFT on monthly
dftest(monthly_temp.temp)


Test Statistic         -6.481466e+00
p-value                 1.291867e-08
Lags Used               2.100000e+01
Observations Used       7.700000e+02
Critical Value (5%)    -2.865301e+00
Critical Value (1%)    -3.438871e+00
Critical Value (10%)   -2.568773e+00
dtype: float64

In [35]:
# run DFT on annual
dftest(annual_temp.temp)


Test Statistic         -7.878242e+00
p-value                 4.779473e-12
Lags Used               0.000000e+00
Observations Used       6.500000e+01
Critical Value (5%)    -2.907154e+00
Critical Value (1%)    -3.535217e+00
Critical Value (10%)   -2.591103e+00
dtype: float64

ARIMA with Statsmodels

Enter Autoregressive Integrated Moving Average (ARIMA) modeling. When we have autocorrelation between outcomes and their ancestors, we will see a theme, or relationship in the outcome plot. This relationship can be modeled in its own way, allowing us to predict the future with a confidence level commensurate to the strength of the relationship and the proximity to known values (prediction weakens the further out we go).

Autoregressive Models

Autocorrelation: a variable's correlation with itself at different lags.

For second-order stationary (both mean and variance: $\mu_t = \mu$ and $\sigma_t^2 = \sigma^2$ for all $t$) data, autocovariance is expressed as a function only of the time lag $k$:

$$ \gamma_k = E[(x_t-\mu)(x_{t+k} - \mu)] $$

Therefore, the autocorrelation function is defined as:

$$ \rho_k = \frac{\gamma_k}{\sigma^2} $$

We use the plot of these values at different lags to determine optimal ARIMA parameters.

By Tomaschwutz - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=14740378

Some things to note:

  1. AR models propagate shocks infinitely
  2. Current random error is the epsilon error term
  3. if a process depends on previous values of itself then it is an AR process. If it depends on previous errors than it is an MA process.
  4. AR processes will exhibit exponential decay in ACR and a cut-off in PACR

Moving Average Models (This is NOT a Simple/Weighted/Exponential Moving Average)

Some things to note:

  1. MA models do not propagate shocks infinitely; they die after q lags
  2. All previous errors up to a lag are rolled into the epsilon error term for that period
  3. if a process depends on previous values of itself then it is an AR process. If it depends on previous errors than it is an MA process.
  4. MA processes will exhibit exponential decay in PACR and a cut-off in ACR


In [36]:
# define helper plot function for visualization
import statsmodels.tsa.api as smt

def plots(data, lags=None):
    layout = (1, 3)
    raw  = plt.subplot2grid(layout, (0, 0))
    acf  = plt.subplot2grid(layout, (0, 1))
    pacf = plt.subplot2grid(layout, (0, 2))
    
    data.plot(ax=raw)
    smt.graphics.plot_acf(data, lags=lags, ax=acf)
    smt.graphics.plot_pacf(data, lags=lags, ax=pacf)
    sns.despine()
    plt.tight_layout()

In [37]:
# helper plot for monthly temps
plots(monthly_temp, lags=75);


Box-Jenkins Method

ACF Shape Indicated Model
Exponential, decaying to zero Autoregressive model. Use the partial autocorrelation plot to identify the order of the autoregressive model.
Alternating positive and negative, decaying to zero Autoregressive model. Use the partial autocorrelation plot to help identify the order.
One or more spikes, rest are essentially zero Moving average model, order identified by where plot becomes zero.
Decay, starting after a few lags Mixed autoregressive and moving average (ARMA) model.
All zero or close to zero Data are essentially random.
High values at fixed intervals Include seasonal autoregressive term.
No decay to zero Series is not stationary.

In [54]:
# we might need to install dev version for statespace functionality
#!pip install git+https://github.com/statsmodels/statsmodels.git
import statsmodels.api as sm

# fit SARIMA monthly based on helper plots
sar = sm.tsa.statespace.SARIMAX(monthly_temp.temp, order=(3,0,0), seasonal_order=(0,1,1,12), trend='c').fit()
sar.summary()


Out[54]:
Statespace Model Results
Dep. Variable: temp No. Observations: 792
Model: SARIMAX(3, 0, 0)x(0, 1, 1, 12) Log Likelihood -2126.970
Date: Wed, 03 May 2017 AIC 4265.939
Time: 17:24:55 BIC 4293.895
Sample: 0 HQIC 4276.691
- 792
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0113 0.008 1.509 0.131 -0.003 0.026
ar.L1 0.1711 0.036 4.741 0.000 0.100 0.242
ar.L2 0.0229 0.036 0.638 0.524 -0.048 0.093
ar.L3 0.0615 0.037 1.664 0.096 -0.011 0.134
ma.S.L12 -0.9998 2.178 -0.459 0.646 -5.269 3.270
sigma2 12.8275 27.819 0.461 0.645 -41.697 67.352
Ljung-Box (Q): 21.90 Jarque-Bera (JB): 242.83
Prob(Q): 0.99 Prob(JB): 0.00
Heteroskedasticity (H): 0.70 Skew: -0.55
Prob(H) (two-sided): 0.00 Kurtosis: 5.50

In [51]:
# plot resids
plots(sar.resid, lags=40);



In [56]:
# plot residual diagnostics
sar.plot_diagnostics();



In [57]:
# plot predictions
monthly_temp['forecast'] = sar.predict(start = 750, end= 790, dynamic=False)  
monthly_temp[730:][['temp', 'forecast']].plot();


Statistical Tests

  • Normality (Jarque-Bera)

    • Null hypothesis is normally distributed residuals (good, plays well with RMSE and similar error metrics)
  • Serial correlation (Ljung-Box)

    • Null hypothesis is no serial correlation in residuals (independent of each other)
  • Heteroskedasticity

    • Tests for change in variance between residuals.
    • The null hypothesis is of no heteroskedasticity. That means different things depending on which alternative is selected:
      • Increasing: Null hypothesis is that the variance is not increasing throughout the sample; that the sum-of-squares in the later subsample is not greater than the sum-of-squares in the earlier subsample.
      • Decreasing: Null hypothesis is that the variance is not decreasing throughout the sample; that the sum-of-squares in the earlier subsample is not greater than the sum-of-squares in the later subsample.
      • Two-sided (default): Null hypothesis is that the variance is not changing throughout the sample. Both that the sum-of-squares in the earlier subsample is not greater than the sum-of-squares in the later subsample and that the sum-of-squares in the later subsample is not greater than the sum-of-squares in the earlier subsample.
  • Durbin Watson

    • Tests autocorrelation of residuals: we want between 1-3, 2 is ideal (no serial correlation)

In [ ]:
# create and run statistical tests on model

Note on autofit methods

R has an autoARIMA function (and other automagic methods) that gridsearches/optimizes our model parameters for us. Over time, more of these goodies are porting to Python (e.g. statsmodels.tsa.x13 import x13_arima_select_order). While there's nothing wrong with utilizing these resources, the human makes the final determination! Don't become over-reliant on these methods, especially early on when you are grasping the underlying mechanics and theory!


In [ ]:
# autoselect for monthly, limited to only searching AR and MA parameters
autores = sm.tsa.arma_order_select_ic(monthly_temp.temp, ic=['aic', 'bic'], trend='c', max_ar=4, max_ma=4, fit_kw=dict(method='css-mle'))

print('AIC', autores.aic_min_order) # will use this as inputs for annual
print('BIC', autores.bic_min_order)

In [ ]:
# using itertools to gridsearch solutions
import itertools
import itertools

#set parameter range; UPDATE THESE!
p = q = range(0, 3)
d = range(0, 2)
season = 12

# list of all parameter combos
pdq = list(itertools.product(p, d, q))
# same for seasonal variant
seasonal_pdq = [(x[0], x[1], x[2], season) for x in list(itertools.product(p, d, q))]

print('SARIMAX: {} , {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} , {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} , {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} , {}'.format(pdq[2], seasonal_pdq[4]))
# find optimal ARIMA for annual warnings.filterwarnings("ignore") # specify to ignore warning messages for param in pdq: for param_seasonal in seasonal_pdq: try: model = sm.tsa.statespace.SARIMAX(annual_temp, order=param, seasonal_order=param_seasonal, enforce_stationarity=False, enforce_invertibility=False).fit() print('ARIMA{},{}12 - AIC:{}'.format(param, param_seasonal, results.aic)) except: continue

Predicting Sunspots with Pyflux and ARIMA


In [58]:
# read and plot data
data = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.year.csv') #

plt.figure()
plt.plot(data.index,data['sunspot.year'])
plt.ylabel('Sunspots')
plt.title('Yearly Sunspot Data');



In [59]:
# fit and summarize model
model = pf.ARIMA(data=data,ar=4,ma=4,integ=0,target='sunspot.year')
x = model.fit("MLE")
x.summary()


Normal ARIMA(4,0,4)                                                                                       
======================================================= ==================================================
Dependent Variable: sunspot.year                        Method: MLE                                       
Start Date: 4                                           Log Likelihood: -1180.5324                        
End Date: 288                                           AIC: 2381.0648                                    
Number of observations: 285                             BIC: 2417.5897                                    
==========================================================================================================
Latent Variable                          Estimate   Std Error  z        P>|z|    95% C.I.                 
======================================== ========== ========== ======== ======== =========================
Constant                                 2.6356     1.0926     2.4122   0.0159   (0.494 | 4.7772)         
AR(1)                                    1.8152     0.1034     17.558   0.0      (1.6126 | 2.0178)        
AR(2)                                    -0.6074    0.2658     -2.2847  0.0223   (-1.1284 | -0.0863)      
AR(3)                                    -0.866     0.2705     -3.2013  0.0014   (-1.3961 | -0.3358)      
AR(4)                                    0.608      0.1073     5.6651   0.0      (0.3976 | 0.8183)        
MA(1)                                    -0.6373    0.0957     -6.659   0.0      (-0.8249 | -0.4498)      
MA(2)                                    -0.5622    0.1495     -3.7593  0.0002   (-0.8553 | -0.2691)      
MA(3)                                    0.2918     0.1206     2.4206   0.0155   (0.0555 | 0.5282)        
MA(4)                                    0.2367     0.0684     3.4618   0.0005   (0.1027 | 0.3706)        
Normal Scale                             15.2303                                                          
==========================================================================================================

In [60]:
# plot z-scores of feature coefficients
model.plot_z(indices=range(1,9))



In [61]:
# plot model
model.plot_fit(figsize=(15,5))



In [62]:
# plot in sample
model.plot_predict_is(50, figsize=(15,5))



In [63]:
# plot forecast
model.plot_predict(h=20,past_values=20, figsize=(15,5))



In [64]:
# compare tail to prediction
data.tail()


Out[64]:
Unnamed: 0 time sunspot.year
284 285 1984 45.9
285 286 1985 17.9
286 287 1986 13.4
287 288 1987 29.2
288 289 1988 100.2

In [66]:
model.predict(h=12)


Out[66]:
sunspot.year
289 138.464513
290 154.282602
291 140.500016
292 113.826086
293 74.497439
294 40.860413
295 18.408237
296 15.923025
297 30.266495
298 56.804989
299 84.768266
300 105.477192

ARIMA with Pyflux: Student Practice


In [ ]:
# helper plot

In [ ]:
# build and summarize model

In [ ]:
# plot z-scores of feature coefficients

In [ ]:
# plot model against raw data

In [ ]:
# plot predictions

In [ ]:
# plot forecast

In [ ]:
# predict future values

CO2 Levels and Forecasting

Statsmodels has some 'built-in' time series datasets to play with, including one that tracks atmospheric CO2 from continuous air samples at Mauna Loa Observatory in Hawaii. This data includes CO2 samples from MAR 1958 to DEC 2001.

Credits


In [ ]:
# load data
data = sm.datasets.co2.load_pandas()
co2 = data.data

In [ ]:
# resample to monthly and check missing values
co2 = co2['co2'].resample('MS').mean()

co2 = co2.fillna(co2.bfill())
co2.isnull().sum()

In [ ]:
# decompose data into trend, seasonal, and residual
decomposition = sm.tsa.seasonal_decompose(co2, model='additive')
fig = decomposition.plot()
plt.show()
# optimize our SARIMAX model using itertools: CAREFUL, this can take a while warnings.filterwarnings("ignore") # specify to ignore warning messages for param in pdq: for param_seasonal in seasonal_pdq: try: model = sm.tsa.statespace.SARIMAX(co2, order=param, seasonal_order=param_seasonal, enforce_stationarity=False, enforce_invertibility=False).fit() print('ARIMA{},{} - AIC:{}'.format(param, param_seasonal, model.aic)) except: continue

In [ ]:
# build model
co2sar = sm.tsa.statespace.SARIMAX(co2, order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False).fit()
co2sar.summary()

In [ ]:
# check diagnostics
co2sar.plot_diagnostics();

In [ ]:
# create predictions and confidence intervals
pred = co2sar.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=False) # we use as many true values as possible to predict
pred_ci = pred.conf_int()

In [ ]:
# plot predictions
ax = co2['1993':].plot(label='Observed CO2 Levels')
pred.predicted_mean.plot(ax=ax, label='Forecast', alpha=.8) # this is using all available info

ax.fill_between(pred_ci.index, pred_ci.iloc[:, 0], pred_ci.iloc[:, 1], color='k', alpha=.1)

ax.set_xlabel('Year')
ax.set_ylabel('CO2')
plt.legend()

plt.show();

In [ ]:
# compute mean square error
fcast = pred.predicted_mean
true = co2['1998-01-01':]

mse = ((fcast - true) ** 2).mean()
print('MSE of our forecasts is {}'.format(round(mse, 3)))

In [ ]:
# dynamic forecast
fcast = co2sar.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=True, full_results=True)
fcast_ci = fcast.conf_int()

# caution: this is modeling assumptions on top of assumptions

In [ ]:
# student practice: use template from above
# plot predictions

In [ ]:
# compute mean square error

In [ ]:
# forecast next 100 months and get confidence interval

In [67]:
# plot forecast

Closing

  • Which areas of time series modeling are most interesting?
  • Which concepts are most challenging?
  • How will you apply these concepts to your work or personal projects?
  • Have you used other packages or tools that you've found helpful?

For more, check out the Facebook Prophet and Pyflux projects.

Connect with me on LinkedIn and Github